Micro4all workflow
Introduction
This package is intended to record R functions we usually use in our laboratory for the analysis of amplicon data. The following tutorial has been created to show the main utilities of this package, as well as to make public our main workflow. We will cover several steps, including:
- Produce an amplicon sequence variant (ASV) table, making use of dada21 package.
- Study microbial diversity (\(\alpha\) and \(\beta\) diversity).
- Get taxonomical profiles at phylum and genus level.
- Differential abundance analysis with ANCOMBC2 package.
- Ordination graphics and environmental fitting of soil physicochemical parameters.
Dataset
In this tutorial, we will use a bunch of sequences of our own. Specifically, we collected endosphere samples of 8 olive trees (Picual variety) from three different locations (let’s call them location 1, location 2 and location 3) and sequenced the V3-V4 region of the 16S rRNA gene (with Illumina MiSeq technology) in order to study the bacterial community. These paired-end fastq files (already “demultiplexed” and without barcodes) can be download here. In this link you can also find a phyloseq object that will be used in environmental fitting section of this tutorial.
Install package micro4all
Usually, some dependencies are needed for micro4all, like microbiome, BiocGenerics or phyloseq packages. I recommend installing them (if not present before) with BioConductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("microbiome")Let’s load the library
#devtools::install_github("nuriamw/micro4all")
library(micro4all)Produce an amplicon sequence variant (ASV) table
The first part of this tutorial is mainly based in the information provided by dada2 but with our own implementations. We will end this section with an ASV table, already cleaned and ready for diversity analysis.
Prepare data: libraries and read sequences files
library(dada2); packageVersion("dada2")#> [1] '1.20.0'
library(ShortRead);packageVersion("ShortRead")#> [1] '1.50.0'
library(devtools);packageVersion("devtools")#> [1] '2.4.2'
library(ggplot2);packageVersion("ggplot2")#> [1] '3.3.5'
library(tidyverse);packageVersion("tidyverse")#> [1] '1.3.1'
path <- "~/Escritorio/data_micro4all/" # CHANGE ME to the directory containing the fastq files after unzipping.
list.files(path)
#> [1] "ASV_length_nochim.rds" "cutadapt" "figaro"
#> [4] "filtered" "filtN" "L1-1_S102_R1_001.fastq"
#> [7] "L1-1_S102_R2_001.fastq" "L1-2_S112_R1_001.fastq" "L1-2_S112_R2_001.fastq"
#> [10] "L1-3_S121_R1_001.fastq" "L1-3_S121_R2_001.fastq" "L1-4_S130_R1_001.fastq"
#> [13] "L1-4_S130_R2_001.fastq" "L1-5_S138_R1_001.fastq" "L1-5_S138_R2_001.fastq"
#> [16] "L1-6_S145_R1_001.fastq" "L1-6_S145_R2_001.fastq" "L1-7_S152_R1_001.fastq"
#> [19] "L1-7_S152_R2_001.fastq" "L1-8_S159_R1_001.fastq" "L1-8_S159_R2_001.fastq"
#> [22] "L2-1_S139_R1_001.fastq" "L2-1_S139_R2_001.fastq" "L2-2_S146_R1_001.fastq"
#> [25] "L2-2_S146_R2_001.fastq" "L2-3_S153_R1_001.fastq" "L2-3_S153_R2_001.fastq"
#> [28] "L2-4_S160_R1_001.fastq" "L2-4_S160_R2_001.fastq" "L2-5_S104_R1_001.fastq"
#> [31] "L2-5_S104_R2_001.fastq" "L2-6_S114_R1_001.fastq" "L2-6_S114_R2_001.fastq"
#> [34] "L2-7_S123_R1_001.fastq" "L2-7_S123_R2_001.fastq" "L2-8_S132_R1_001.fastq"
#> [37] "L2-8_S132_R2_001.fastq" "L3-1_S136_R1_001.fastq" "L3-1_S136_R2_001.fastq"
#> [40] "L3-2_S143_R1_001.fastq" "L3-2_S143_R2_001.fastq" "L3-3_S150_R1_001.fastq"
#> [43] "L3-3_S150_R2_001.fastq" "L3-4_S157_R1_001.fastq" "L3-4_S157_R2_001.fastq"
#> [46] "L3-5_S101_R1_001.fastq" "L3-5_S101_R2_001.fastq" "L3-6_S111_R1_001.fastq"
#> [49] "L3-6_S111_R2_001.fastq" "L3-7_S120_R1_001.fastq" "L3-7_S120_R2_001.fastq"
#> [52] "L3-8_S129_R1_001.fastq" "L3-8_S129_R2_001.fastq" "MOCK-1_S169_R1_001.fastq"
#> [55] "MOCK-1_S169_R2_001.fastq" "MOCK-2_S176_R1_001.fastq" "MOCK-2_S176_R2_001.fastq"
#> [58] "MOCK-3_S183_R1_001.fastq" "MOCK-3_S183_R2_001.fastq" "nochim"
#> [61] "num_seqs.txt" "rdp_train_set_18_H.fa" "tabla_otu_inicial.rds"
#> [64] "taxa_rdp_16S.rds"
## SORT FORWARD AND REVERSE READS SEPARETELY; forward and reverse fastq filenames have format: SAMPLENAME_R1_001.fastq and SAMPLENAME_R2_001.fastq ##
fnFs <- sort(list.files(path, pattern="_R1_001.fastq", full.names = TRUE))
fnRs <- sort(list.files(path, pattern="_R2_001.fastq", full.names = TRUE))
# Extract sample names, assuming filenames have format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(basename(fnFs), "_"), `[`, 1)📝TIP: If your file names are more complex and you need to remove some previous codes, you can always use gsub over sample.names or strsplit again.
Check sequences quality
A crucial step in any sequencing analysis pipeline is quality checking and we will use several approaches. First, we will generate a table with the number of reads in each sample, just to check if anything went wrong during sequencing process.
##COUNT NUMBER OF READS IN EACH SAMPLE BEFORE FILTERING
raw_reads_count <- NULL
for (i in 1:length(fnFs)){
raw_reads_count <- rbind(raw_reads_count, length(ShortRead::readFastq(fnFs[i])))}
#Format table
rownames(raw_reads_count)<- sample.names
colnames(raw_reads_count)<- "Number of reads"
#Get min and max number of sequences
min(raw_reads_count)#> [1] 66426
max(raw_reads_count)#> [1] 125873
As we can see, even the minimum number of sequences is good enough for our analysis. We can also check with a histogram the distribution of sequences length.
reads <- ShortRead::readFastq(fnFs)
#Get lengths with unique
counts= NULL
uniques <- unique(reads@quality@quality@ranges@width)
#count number of sequences for each length
for (i in 1:length(uniques)) {
counts<- rbind(counts,length(which(reads@quality@quality@ranges@width==uniques[i])))
}
#format histogram table
histogram <- cbind(uniques,counts)
colnames(histogram) <- c("Seq.length", "counts")
#Check histogram matrix
head(histogram[order(histogram[,1],decreasing = TRUE),]) #Most sequences fall in expected sequence length#> Seq.length counts
#> [1,] 301 2154642
#> [2,] 300 71753
#> [3,] 299 16251
#> [4,] 298 61651
#> [5,] 297 10119
#> [6,] 296 377
# PLOT HISTOGRAM
hist(reads@quality@quality@ranges@width, main="Forward length distribution", xlab="Sequence length", ylab="Raw reads")ITS considerations
Because of the variable length of ITS region, this step provides little information when analyzing ITS amplicon sequences.
Moreover, plotQualityProfile function of dada2 is very useful for a quick visualization of sequences quality profile.
## VIEW AND SAVE QUALITY PLOT FOR FW AND RV ##
plotQualityProfile(fnFs[1:2])#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
plotQualityProfile(fnRs[1:2])#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
Figaro: selection of trimming parameters
In this tutorial, we will use the Figaro3 tool. This tool makes the selection of trimming parameters for dada2 more objective. Sadly, Figaro has not yet been updated to lead with variations in sequence lengths. We will solve this following author’s recommendation (which you can find here). It consist on performing a pre-trimming step at about 295 bp. Let’s see how it works!
##FIGARo
##FILTER READS
figFs <- file.path(path, "figaro", basename(fnFs))
figRs <- file.path(path, "figaro", basename(fnRs))
##TRIMMING AT 295 pb
out.figaro <- filterAndTrim(fnFs, figFs, fnRs, figRs,
compress=TRUE, multithread=TRUE, truncLen=c(295,295)) #
##RUN FIGARO
figaro <- system(("python3 /home/programs/figaro/figaro/figaro.py -i ~/Escritorio/data_micro4all/figaro -o ~/Escritorio/data_micro4all/figaro -a 426 -f 17 -r 21"),
intern=TRUE) #path to figaro program -i path to input - path to output -a length of your amplicon without primers, -f primer forward length, -r primer reverse length
head(figaro)#> [1] "Forward read length: 295"
#> [2] "Reverse read length: 295"
#> [3] "{\"trimPosition\": [279, 205], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.69, \"score\": 76.68937485688116}"
#> [4] "{\"trimPosition\": [278, 206], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.68, \"score\": 76.6750629722922}"
#> [5] "{\"trimPosition\": [277, 207], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.67, \"score\": 76.66838409281735}"
#> [6] "{\"trimPosition\": [280, 204], \"maxExpectedError\": [3, 2], \"readRetentionPercent\": 81.66, \"score\": 76.6617052133425}"
We will select the first result, which gives us the *trimming positions** and *maximum expected errors that produces the greater score according to Fígaro** model.
ITS considerations
As mentioned before, ITS region has very variable length. Consequently, no trimming step is applied to these sequences and it would not be necessary to use Figaro.
Filter and trim sequences
We will apply Fígaro parameters to filterAndTrim function
### FILTER AND TRIM SEQUENCES ACCORDING TO FIGARO ####
## Place filtered files in filtered/ subdirectory
filtFs <- file.path(path, "filtered", basename(fnFs))
filtRs <- file.path(path, "filtered", basename(fnRs))
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs, truncLen=c(279,205),
maxN=0, maxEE=c(3,2), truncQ=2, rm.phix=TRUE,
compress=TRUE, multithread=TRUE, minLen=50)
head(out)#> reads.in reads.out
#> L1-1_S102_R1_001.fastq 80054 61741
#> L1-2_S112_R1_001.fastq 76776 60781
#> L1-3_S121_R1_001.fastq 84947 67563
#> L1-4_S130_R1_001.fastq 87190 68997
#> L1-5_S138_R1_001.fastq 114810 93159
#> L1-6_S145_R1_001.fastq 88519 70075
Cutadapt: efficient removal of primers
Now that we have run Figaro and performed quality trimming, we can follow with primer removing. This will be done with cutapat4 tool. Cutadapt should not be applied before Figaro because it can modify the quality profiles in which Figaro is based.
#### IDENTIFY PRIMERS ####
FWD <- "CCTACGGGNBGCASCAG" ## CHANGE ME to your forward primer sequence
REV <- "GACTACNVGGGTATCTAATCC" ## CHANGE ME to your reverse primer sequence
## VERIFY PRESENCE AND ORENTATION OF PRIMERS ##
allOrients <- function(primer) {
# Create all orientations of the input sequence
require(Biostrings)
dna <- DNAString(primer) # The Biostrings works w/ DNAString objects rather than character vectors
orients <- c(Forward = dna, Complement = Biostrings::complement(dna), Reverse = reverse(dna),
RevComp = reverseComplement(dna))
return(sapply(orients, toString)) # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients#> Forward Complement Reverse RevComp
#> "CCTACGGGNBGCASCAG" "GGATGCCCNVCGTSGTC" "GACSACGBNGGGCATCC" "CTGSTGCVNCCCGTAGG"
REV.orients#> Forward Complement Reverse RevComp
#> "GACTACNVGGGTATCTAATCC" "CTGATGNBCCCATAGATTAGG" "CCTAATCTATGGGVNCATCAG" "GGATTAGATACCCBNGTAGTC"
## COUNT THE APPEARENCE AND ORIENTATION OF PRIMERS ##
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = filtFs[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = filtRs[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = filtFs[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = filtRs[[1]]))#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 59926 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 2 0 0 0
#> REV.ReverseReads 60270 0 0 0
At this point, and as it is explained in dada2 ITS tutorial, we found what we expected to: the FWD primer appears in the forward reads in its forward orientation, but it also show up in some reverse reads in its reverse-complement orientation. Something similar occurs with reverse primer.
Now, let’s run cutadapt. We use argument ^ for anchoring primers. Moreover, we combine it with the –discard-untrimed argument. In this way, we make sure that a sequence is not retained if only the reverse primer is found. Forward and reverse primers should be present in order to keep the sequences.
cutadapt <- "/usr/local/bin/cutadapt" #path to cutadapt
system2(cutadapt, args = c("--version")) # Run shell commands from R
##Create path to cutadapt sequences
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(filtFs))
fnRs.cut <- file.path(path.cut, basename(filtRs))
##Produce arguments for cutadapt
FWD.RC <- dada2:::rc(FWD)
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste0("-a", " ", "^",FWD,"...", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste0("-A"," ","^", REV, "...", FWD.RC)
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2,"-m", 1, # -n 2 required to remove FWD and REV from reads
#-m 1 is required to remove empty sequences for plotting quality plots
"--discard-untrimmed",
"-j",0,#automatically detect number of cores
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
filtFs[i], filtRs[i],# input files
"--report=minimal")) #Report minimal reports a summary
}Let’s see if cutadapt has properly removed the primers
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))#> Forward Complement Reverse RevComp
#> FWD.ForwardReads 1 0 0 0
#> FWD.ReverseReads 0 0 0 0
#> REV.ForwardReads 0 0 0 0
#> REV.ReverseReads 0 0 0 0
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq", full.names = TRUE))
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq", full.names = TRUE))As we can see, there are still some forward primers to be found on forward reads. We should not worry about that. Cutadapt and primerHits have different ways to look for primers in the sequence. Moreover, if primers are still to be found in low quality sequences, they will probably be removed in next steps.
To continue with, we will apply the main core of DADA2 package. As it is very well detailed in its tutorial, we will not make a lot of comments about it
Learn error rates
#### LEARN ERROR RATES ####
errF <- learnErrors(fnFs.cut, multithread=T, verbose=1 ) ##> 109967023 total bases in 419736 reads from 6 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ......
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> Convergence after 5 rounds.
errR <- learnErrors(fnRs.cut, multithread=T, verbose=1) ##> 102040114 total bases in 554558 reads from 8 samples will be used for learning the error rates.
#> Initializing error rates to maximum possible estimate.
#> selfConsist step 1 ........
#> selfConsist step 2
#> selfConsist step 3
#> selfConsist step 4
#> selfConsist step 5
#> selfConsist step 6
#> Convergence after 6 rounds.
View error plots
As we can see, the plots follow the tendency adequately.
plotErrors(errF, nominalQ=TRUE)#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Transformation introduced infinite values in continuous y-axis
plotErrors(errR, nominalQ=TRUE)#> Warning: Transformation introduced infinite values in continuous y-axis
#> Warning: Transformation introduced infinite values in continuous y-axis
Sample inference
dadaFs <- dada(fnFs.cut, err=errF, multithread=TRUE)#> Sample 1 - 61506 reads in 15424 unique sequences.
#> Sample 2 - 60287 reads in 12454 unique sequences.
#> Sample 3 - 67108 reads in 13471 unique sequences.
#> Sample 4 - 68541 reads in 13392 unique sequences.
#> Sample 5 - 92672 reads in 19174 unique sequences.
#> Sample 6 - 69622 reads in 14664 unique sequences.
#> Sample 7 - 66191 reads in 14858 unique sequences.
#> Sample 8 - 68631 reads in 17064 unique sequences.
#> Sample 9 - 64360 reads in 12997 unique sequences.
#> Sample 10 - 73537 reads in 16302 unique sequences.
#> Sample 11 - 64206 reads in 12105 unique sequences.
#> Sample 12 - 74353 reads in 16945 unique sequences.
#> Sample 13 - 62585 reads in 14748 unique sequences.
#> Sample 14 - 72255 reads in 16401 unique sequences.
#> Sample 15 - 69370 reads in 15705 unique sequences.
#> Sample 16 - 56068 reads in 13202 unique sequences.
#> Sample 17 - 73197 reads in 14786 unique sequences.
#> Sample 18 - 58236 reads in 12100 unique sequences.
#> Sample 19 - 56801 reads in 13257 unique sequences.
#> Sample 20 - 56072 reads in 12635 unique sequences.
#> Sample 21 - 57847 reads in 12828 unique sequences.
#> Sample 22 - 65716 reads in 11388 unique sequences.
#> Sample 23 - 69528 reads in 13095 unique sequences.
#> Sample 24 - 52479 reads in 9714 unique sequences.
#> Sample 25 - 89584 reads in 7824 unique sequences.
#> Sample 26 - 93427 reads in 8209 unique sequences.
#> Sample 27 - 106500 reads in 9724 unique sequences.
dadaRs <- dada(fnRs.cut, err=errR, multithread=TRUE)#> Sample 1 - 61506 reads in 15090 unique sequences.
#> Sample 2 - 60287 reads in 8972 unique sequences.
#> Sample 3 - 67108 reads in 10240 unique sequences.
#> Sample 4 - 68541 reads in 11098 unique sequences.
#> Sample 5 - 92672 reads in 15470 unique sequences.
#> Sample 6 - 69622 reads in 11810 unique sequences.
#> Sample 7 - 66191 reads in 10626 unique sequences.
#> Sample 8 - 68631 reads in 13787 unique sequences.
#> Sample 9 - 64360 reads in 9862 unique sequences.
#> Sample 10 - 73537 reads in 11339 unique sequences.
#> Sample 11 - 64206 reads in 8219 unique sequences.
#> Sample 12 - 74353 reads in 12900 unique sequences.
#> Sample 13 - 62585 reads in 11369 unique sequences.
#> Sample 14 - 72255 reads in 12843 unique sequences.
#> Sample 15 - 69370 reads in 12557 unique sequences.
#> Sample 16 - 56068 reads in 11103 unique sequences.
#> Sample 17 - 73197 reads in 12138 unique sequences.
#> Sample 18 - 58236 reads in 9899 unique sequences.
#> Sample 19 - 56801 reads in 9540 unique sequences.
#> Sample 20 - 56072 reads in 10430 unique sequences.
#> Sample 21 - 57847 reads in 9716 unique sequences.
#> Sample 22 - 65716 reads in 9249 unique sequences.
#> Sample 23 - 69528 reads in 11676 unique sequences.
#> Sample 24 - 52479 reads in 8641 unique sequences.
#> Sample 25 - 89584 reads in 6378 unique sequences.
#> Sample 26 - 93427 reads in 6198 unique sequences.
#> Sample 27 - 106500 reads in 7362 unique sequences.
dadaFs[[1]]#> dada-class: object describing DADA2 denoising results
#> 758 sequence variants were inferred from 15424 input unique sequences.
#> Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
names(dadaFs) <- sample.names
names(dadaRs) <- sample.namesMerge paired-end reads
mergers <- mergePairs(dadaFs, fnFs.cut, dadaRs, fnRs.cut, verbose=TRUE)#> 58056 paired-reads (in 841 unique pairings) successfully merged out of 59510 (in 1465 pairings) input.
#> 57884 paired-reads (in 527 unique pairings) successfully merged out of 58969 (in 973 pairings) input.
#> 65180 paired-reads (in 492 unique pairings) successfully merged out of 66042 (in 881 pairings) input.
#> 66703 paired-reads (in 486 unique pairings) successfully merged out of 67552 (in 865 pairings) input.
#> 89414 paired-reads (in 932 unique pairings) successfully merged out of 90777 (in 1561 pairings) input.
#> 66897 paired-reads (in 665 unique pairings) successfully merged out of 68102 (in 1223 pairings) input.
#> 63906 paired-reads (in 609 unique pairings) successfully merged out of 64927 (in 1106 pairings) input.
#> 65300 paired-reads (in 860 unique pairings) successfully merged out of 66704 (in 1524 pairings) input.
#> 62656 paired-reads (in 606 unique pairings) successfully merged out of 63300 (in 889 pairings) input.
#> 71291 paired-reads (in 635 unique pairings) successfully merged out of 72263 (in 1082 pairings) input.
#> 62506 paired-reads (in 443 unique pairings) successfully merged out of 63204 (in 757 pairings) input.
#> 71825 paired-reads (in 733 unique pairings) successfully merged out of 72805 (in 1226 pairings) input.
#> 59867 paired-reads (in 626 unique pairings) successfully merged out of 61145 (in 1295 pairings) input.
#> 69397 paired-reads (in 807 unique pairings) successfully merged out of 70680 (in 1361 pairings) input.
#> 66800 paired-reads (in 773 unique pairings) successfully merged out of 67865 (in 1198 pairings) input.
#> 53498 paired-reads (in 691 unique pairings) successfully merged out of 54585 (in 1149 pairings) input.
#> 70984 paired-reads (in 668 unique pairings) successfully merged out of 71999 (in 1094 pairings) input.
#> 56672 paired-reads (in 449 unique pairings) successfully merged out of 57345 (in 744 pairings) input.
#> 54758 paired-reads (in 705 unique pairings) successfully merged out of 55509 (in 1006 pairings) input.
#> 54202 paired-reads (in 648 unique pairings) successfully merged out of 54911 (in 986 pairings) input.
#> 55901 paired-reads (in 585 unique pairings) successfully merged out of 56778 (in 1000 pairings) input.
#> 64397 paired-reads (in 475 unique pairings) successfully merged out of 65119 (in 738 pairings) input.
#> 67867 paired-reads (in 532 unique pairings) successfully merged out of 68681 (in 897 pairings) input.
#> 51487 paired-reads (in 353 unique pairings) successfully merged out of 51953 (in 540 pairings) input.
#> 89364 paired-reads (in 74 unique pairings) successfully merged out of 89388 (in 85 pairings) input.
#> 93212 paired-reads (in 82 unique pairings) successfully merged out of 93267 (in 96 pairings) input.
#> 106318 paired-reads (in 139 unique pairings) successfully merged out of 106393 (in 160 pairings) input.
# Inspect the merger data.frame from the first sample
head(mergers[[1]])#> sequence
#> 1 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCGCGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 2 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGCTTTCGGGTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 3 TGGGGAATATTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGAAGAAGCGAAAGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTCTGTCACGTCGGATGTGAAAGCCCGGGGCTTAACCCCGGGTCTGCATTCGATACGGGCAGACTAGAGTGTGGTAGGGGAGATCGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGATCTCTGGGCCATTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 4 TGGGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCGCCAGGGACGAAGGGTGACTGACGGTACCTGGAGAAGAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCGAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCGGCCGTGAAAACTTCACGCTTAACGTGGAGCTTGCGGTCGATACGGGCAGACTTGAGTTCGGTAGGGGAGACTGGAATTCCTGGTGTAGCGGTGAAATGCGCAGATATCAGGAGGAACACCGGTGGCGAAGGCGGGTCTCTGGGCCGATACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 5 TGGGGAATATTGCGCAATGGGCGGAAGCCTGACGCAGCGACGCCGCGTGGGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCTAACGTGACGGTACCTGCAGAAGAAGCGCCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGCGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGAGCTCGTAGGCGGTTTGTTGCGTCCGTCGTGAAAGCCCACGGCTTAACTGTGGGTCTGCGGTGGATACGGGCAGACTAGAGGCAGGTAGGGGAGCATGGAATTCCCGGTGTAGCGGTGAAATGCGCAGATATCGGGAGGAACACCGGTGGCGAAGGCGGTGCTCTGGGCCTGTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA
#> 6 TGGGGAATCTTGGACAATGGGCGAAAGCCCGATCCAGCAATATCGCGTGAGTGAAGAAGGGCAATGCCGCTTGTAAAGCTCTTTCATCGAGTGCGCGATCATGACAAGACTCGAGGAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAAGACGGGGGGGGCAAGTGTTCTTCGGAATGACTGGGCGTAAAGGGCACGTAGGCGGTGAATCGGGTTGAAAGTGAAAGTCGCCAAAAACTGGCGGAATGCTCTCGAAACCAATTCACTTGAGTGAGACAGAGGAGAGTGGAATTTCGTGTGTAGGGGTGAAATCCGGAGATCTACGAAGGAACGCCAAAAGCGAAGGCAGCTCTCTGGGTCCCTACCGACGCTGGGGTGCGAAAGCATGGGGAGCGAACA
#> abundance forward reverse nmatch nmismatch nindel prefer accept
#> 1 6997 1 1 39 0 0 2 TRUE
#> 2 5696 2 2 37 0 0 2 TRUE
#> 3 3738 3 1 39 0 0 2 TRUE
#> 4 2282 4 2 39 0 0 2 TRUE
#> 5 1950 5 5 39 0 0 2 TRUE
#> 6 1931 6 3 42 0 0 2 TRUE
Construct sequence table
seqtab <- makeSequenceTable(mergers)
dim(seqtab) #> [1] 27 7080
Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=TRUE, verbose=TRUE)#> Identified 2163 bimeras out of 7080 input sequences.
dim(seqtab.nochim)#> [1] 27 4917
Inspect ASV length and abundance
table(nchar(getSequences(seqtab.nochim))) #Number of ASV of each length#>
#> 244 247 253 255 259 262 264 269 273 276 299 304 311 316 321 327 338 339 356
#> 2 2 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 1 1
#> 359 361 364 369 370 373 376 380 382 384 385 386 390 396 397 400 401 402 403
#> 1 1 2 1 1 1 1 1 2 2 6 8 1 5 2 4 44 984 294
#> 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422
#> 202 156 20 387 82 44 24 20 3 11 7 5 50 16 18 32 27 216 552
#> 423 424 425 426 427 428 429 430 431 432
#> 40 29 57 265 1026 218 14 3 2 6
This is useful, but we also want to know the abundance of sequences for each length. Let’s calculate it and make a plot
reads.per.seqlen <- tapply(colSums(seqtab.nochim), factor(nchar(getSequences(seqtab.nochim))), sum) #number of sequences for each length
reads.per.seqlen#> 244 247 253 255 259 262 264 269 273 276 299 304 311
#> 4 104 2 4 2 175 6 3 2 2 2 4 2
#> 316 321 327 338 339 356 359 361 364 369 370 373 376
#> 2 2 4 2 2 2 2 4 50 4 2 6 4
#> 380 382 384 385 386 390 396 397 400 401 402 403 404
#> 4 36451 8 40 1082 2 15 13 661 504 130086 99017 277172
#> 405 406 407 408 409 410 411 412 413 414 415 416 417
#> 3998 995 592438 116044 58285 1601 1328 4790 179 2455 93 2878 133
#> 418 419 420 421 422 423 424 425 426 427 428 429 430
#> 612 5778 1023 12092 29632 268 448 4098 12680 370689 13647 119 8
#> 431 432
#> 4 213
## Plot length distribution
table_reads_seqlen <- data.frame(length=as.numeric(names(reads.per.seqlen)), count=reads.per.seqlen)
ggplot(data=table_reads_seqlen, aes(x=length, y=count)) + geom_col()According to previous information, we will choose the interval ranging from 402 to 428 pb
seqtab.nochim <- seqtab.nochim[,nchar(colnames(seqtab.nochim)) %in% seq(402,428)]Check number of reads
At this point, it is important to check how many sequences we have lost in each step. Apart from filtering, not other step should have led to a substantial decreased in sequence number.
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergers, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
rownames(track) <- sample.names
#Let's apply a mutate to get percentage, making it easier to analyze.
track <- track %>%as.data.frame() %>% mutate(Perc.input=filtered*100/input,
Perc.denoisedF= denoisedF*100/filtered,
Perc.denoisedR= denoisedR*100/filtered,
Perc.merged= merged*100/filtered,
Perc.nonchim= nonchim*100/merged)
head(track)#> input filtered denoisedF denoisedR merged nonchim Perc.input Perc.denoisedF
#> L1-1 80054 61741 60003 60662 58056 55455 77.12419 97.18501
#> L1-2 76776 60781 59308 59694 57884 56153 79.16667 97.57655
#> L1-3 84947 67563 66299 66670 65180 63021 79.53548 98.12915
#> L1-4 87190 68997 67828 68092 66703 63831 79.13408 98.30572
#> L1-5 114810 93159 91321 91801 89414 84491 81.14189 98.02703
#> L1-6 88519 70075 68479 68988 66897 64905 79.16380 97.72244
#> Perc.denoisedR Perc.merged Perc.nonchim
#> L1-1 98.25238 94.03152 95.51984
#> L1-2 98.21161 95.23371 97.00954
#> L1-3 98.67827 96.47292 96.68763
#> L1-4 98.68835 96.67522 95.69435
#> L1-5 98.54228 95.97999 94.49415
#> L1-6 98.44880 95.46486 97.02229
Everything is as expected, so let’s keep on with classification!
Taxonomy classification with RDP
taxa_rdp <- assignTaxonomy(seqtab.nochim, "~/Escritorio/data_micro4all/rdp_train_set_18_H.fa", multithread=TRUE)Get ASV table
We will produce a raw ASV table that can be saved as .txt. This allows us to manipulate the data manually if needed, as well as saving a raw data before further steps are performed.
ASV <- seqtab.nochim
ASVt <- t(ASV)
## SUBSTITUTE 'NA' WITH 'UNCLASSIFIED'and remove species column
taxa_rdp_na <- apply(taxa_rdp,2, tidyr::replace_na, "unclassified")[,-7]
##Create names for each ASV: if there are 1000 ASVs, call them from ASV0001 to ASV1000
number.digit<- nchar(as.integer(nrow(ASVt)))
names <- paste0("ASV%0", number.digit, "d") #As many 0 as digits
ASV_names<- sprintf(names, 1:nrow(ASVt))
## CREATE AND SAVE ASV TABLE
ASV_table_classified_raw<- cbind(as.data.frame(taxa_rdp_na,stringsAsFactors = FALSE),as.data.frame(ASV_names, stringsAsFactors = FALSE),as.data.frame(ASVt,stringsAsFactors = FALSE))
##Make rownames a new column, in order to keep sequences during the filtering process
ASV_seqs <- rownames(ASV_table_classified_raw)
rownames(ASV_table_classified_raw) <- NULL
ASV_table_classified_raw <- cbind(ASV_seqs, ASV_table_classified_raw)MOCK community
In order to filter out sporious ASV, we always introduce in our sequencing three samples of a commercial MOCK community (ZymoBIOMICS Microbial Community Standard II (Log Distribution), ZYMORESEARCH, CA, USA). Because we already know its composition, we can check if there is any ASV identified by the sequencing process that is not a real member of the community. Once identified, we calculate the percentage of sequences it represents. We make the assumption that ASVs that represent less than this percentage should be discarded from our data. In order to automate this analysis, micro4all implements a function called MockCommunity. Let’s see how it works:
ASV_filtered_sorted<-MockCommunity(ASV_table_classified_raw, mock_composition, ASV_column = "ASV_names") #mock_composition is a data frame included as data in this package. See documentation for details.#> ASV0811 does not belong to the MOCK community. It representes a 0.031757 perc. of the sequences
#> and it classifies as Limosilactobacillus
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]ASV0002 does not belong to the MOCK community. It representes a 0.003881 perc. of the sequences
#> and it classifies as Streptomyces
#> Do you want to use this ASV to calculate the percentage? [answer yes or no]You made a decision! ASV0002 which classifies as Streptomyces
#> and represents a 0.003881 perc. of the sequences, was used to calculate the percentage
MockCommunity makes a question to user before making the decision. Why? Sometimes, like in the example above, an ASV could be classified with a different name, still being the same genera. In this example, the first ASV which is not detected in the MOCK community composition table, is Limosilactobacillus. However, we can see that it represents a high percentage of sequences. This is probably a bacillus from MOCK samples. Because of that, we decide to choose the second option.
Once the percentage is applied, we can remove sequences coming from chloroplast or mitochondria.
ASV_final_table<-ASV_filtered_sorted[(which(ASV_filtered_sorted$Genus!="Streptophyta"
& ASV_filtered_sorted$Genus!="Chlorophyta"
&ASV_filtered_sorted$Genus!="Bacillariophyta"
&ASV_filtered_sorted$Family!="Streptophyta"
& ASV_filtered_sorted$Family!="Chlorophyta"
&ASV_filtered_sorted$Family!="Bacillariophyta"
& ASV_filtered_sorted$Family!="Mitochondria"
& ASV_filtered_sorted$Class!="Chloroplast"
& ASV_filtered_sorted$Order!="Chloroplast"
& ASV_filtered_sorted$Kingdom!="Eukaryota"
& ASV_filtered_sorted$Kingdom!="unclassified")),]After this, the presence of Cyanobacteria/Chloroplast at phylum level was checked. There were only two ASV classified as Cyanobacteria/Chloroplast at phylum level which were not classified at deeper levels, so we remove them, as follow:
ASV_final_table<-ASV_final_table[(which(ASV_final_table$Phylum!="Cyanobacteria/Chloroplast"
)),]Diversity analysis
We almost have all components to perform a full diversity analysis. For this purpose, we need a phyloseq object. Before that, we should produce a phylogenetic tree, which allows us to use UniFrac and WeightedUnifrac distances later on. We will use programs as MAFFT5 and FastTreeMP6.
library(seqinr)## GET THE FASTA FILE ##
ASVseqs_MOCK <- as.list(ASV_final_table$ASV_seqs)
write.fasta(ASVseqs_MOCK, names=ASV_final_table$ASV_names, file.out="ASV_tree.fas", open = "w", nbchar =1000 , as.string = FALSE)
##Get the alignment ##
mafft <- "/usr/bin/mafft" #path to program
system2(mafft, args="--version")
system2(mafft, args=c("--auto", "ASV_tree.fas>", "alignment"))
## Get the tree ##
FastTreeMP <- "/home/programs/FastTreeMP/FastTreeMP"
system2(FastTreeMP, args="--version" )
system2(FastTreeMP, args = c("-gamma", "-nt", "-gtr", "-spr",4 ,"-mlacc", 2, "-slownni", "<alignment>", "tree"))# Run FastTreeMP
##Introduce phylogenetic tree
tree <- phyloseq::read_tree("tree")ITS considerations
ITS region is not appropriate for calculating phylogenetic trees. As a result, no Unifrac or Weighted Unifrac distances can be applied.
Create phyloseq
We will put all the information we have (classification, abundance, metadata, ASV sequences and phylogenetic tree) in one object thanks to phyloseq7. We get metadata table from micro4allpackage, as it is stored as data, but it simply consist on a data frame with sample names and location. Later on, this will allow us to group samples by location.
##Load packages
library(phyloseq); packageVersion("phyloseq")#> [1] '1.36.0'
library(Biostrings); packageVersion("Biostrings")#> [1] '2.60.2'
library(GUniFrac); packageVersion("GUniFrac")#> [1] '1.3'
library(phangorn); packageVersion("phangorn")#> [1] '2.7.1'
library(vegan); packageVersion("vegan")#> [1] '2.5.7'
library(pheatmap); packageVersion("pheatmap")#> [1] '1.0.12'
library(colorspace); packageVersion("colorspace")#> [1] '2.0.2'
####CREATE PHYLOSEQ OBJECT FROM TABLES
##Get tax, otu and sequences
tax <- ASV_final_table[,2:8] #Tax
OTU <- ASV_final_table[,9:ncol(ASV_final_table)] #ASV
colnames(OTU) <- str_replace_all(colnames(OTU),"-", "_")
dna<-Biostrings::DNAStringSet(ASV_final_table$ASV_seqs) #Sequences
names(dna)<- ASV_final_table$ASV_names
##ADD ASV NAMES
row.names(tax)<-ASV_final_table$ASV_names
row.names(OTU)<-ASV_final_table$ASV_names
#Check rownames are equal
identical(rownames(OTU), rownames(tax))#> [1] TRUE
##CONVERT TO PHYLOSEQ FORMART
metadata.micro <-micro4all::metadata
metadata.micro$samples <- str_replace_all(metadata.micro$samples,"-", "_")
rownames(metadata.micro) <- str_replace_all(rownames(metadata.micro),"-", "_")
as.data.frame(gsub("-",replacement = ".", metadata.micro))#> gsub("-", replacement = ".", metadata.micro)
#> 1 c("L1_1", "L1_2", "L1_3", "L1_4", "L1_5", "L1_6", "L1_7", "L1_8", "L2_1", "L2_2", "L2_3", "L2_4", "L2_5", "L2_6", "L2_7", "L2_8", "L3_1", "L3_2", "L3_3", "L3_4", "L3_5", "L3_6", "L3_7", "L3_8")
#> 2 c("location1", "location1", "location1", "location1", "location1", "location1", "location1", "location1", "location2", "location2", "location2", "location2", "location2", "location2", "location2", "location2", "location3", "location3", "location3", "location3", "location3", "location3", "location3", "location3")
phy_OTUtable<-otu_table(OTU, taxa_are_rows = T)
phy_taxonomy<-tax_table(as.matrix(tax))
phy_metadata<-sample_data(metadata.micro)
##Introduce phylogenetic tree
unrooted_tree<- tree
is.rooted(unrooted_tree)#> [1] FALSE
##Produce root
tree_root<-root(unrooted_tree, 1, resolve.root = T)
tree_root#>
#> Phylogenetic tree with 1035 tips and 1034 internal nodes.
#>
#> Tip labels:
#> ASV0552, ASV0148, ASV1054, ASV0044, ASV0438, ASV0153, ...
#> Node labels:
#> Root, , 0.000, 0.000, 0.886, 0.807, ...
#>
#> Rooted; includes branch lengths.
is.rooted(tree_root)#> [1] TRUE
##Add tree to phyloseq
location_phyloseq<-phyloseq(phy_OTUtable,phy_taxonomy,phy_metadata,dna,tree_root)\(\alpha\) diversity
Data preparation
First, we will produce a table with \(\alpha\) diversity indices. For this purpose, we will rarefy the data to the lowest sequence number. This is done in order to avoid an over estimation of diversity only because of a greater number of sequences.
Rarefaction curves should be produced first. This makes it easy to visualize if rarefaction process encountered a great loss of information or not. This plot represents number of ASV (richness) against number of sequences. Moreover, a vertical line will show the number of sequences where normalization occurred. If this line is found in the asymptote of every rarefaction curve, that means we are safe: although we would have more sequences for every sample, we would not have a significant greater number of ASV.
ASV <- as.data.frame(t(otu_table(location_phyloseq)))
sample_names <- rownames(ASV)
#Generate rarefaction curves
rarecurve <- rarecurve(ASV, step = 100, label = F)We will make some preparation for ggplot
#For each rarefaction curve, transform rarecurve output to a dataframe.
rarecurve_table <- lapply(rarecurve, function(x){
b <- as.data.frame(x)
b <- data.frame(ASV = b[,1], raw.read = rownames(b))
b$raw.read <- as.numeric(gsub("N", "", b$raw.read))
return(b)
})
#Produce a column with sample names and put everything in one data frame
names(rarecurve_table) <- sample_names
rarecurve_table <- purrr::map_dfr(rarecurve_table, function(x){
z <- data.frame(x)
return(z)
}, .id = "Sample")
#To color lines according to group, let's create a new column
#Coloring
color <- NULL
for (i in (1:length(rarecurve_table$Sample))) {
color <- rbind(color,metadata.micro$location[which(metadata.micro$samples==rarecurve_table$Sample[i])])##Change "Location" to variable of interest
}
#Bind this column
rarecurve_table_color <- cbind(rarecurve_table, color)
colnames(rarecurve_table_color) <- c(colnames(rarecurve_table), "Location")
## RARECURVE WITH GGPLOT ##
rareggplot<-ggplot(rarecurve_table_color, aes(x=raw.read, y=ASV, colour=Location, group=Sample)) + ### IMPORTANT TO GROUP BY SAMPLE
theme_bw()+
geom_point(aes(colour=Location), size=1)+
geom_line(aes(colour=Location),size=0.5)+ #Change this for line thickness
geom_vline(aes(xintercept = min(sample_sums(location_phyloseq))), lty=1, colour="black")+
scale_fill_manual(values = c("location1"= "#33CC00", "location2"= "#ffd624", "location3"="#80CC88"))+
scale_color_manual(values = c("location1"= "#33CC00","location2"= "#ffd624", "location3"="#80CC88"),
name="Location",
breaks=c("location1", "location2","location3"),
labels=c("Location 1","Location 2","Location 3"))+
labs(title="Rarefaction curves", x="Number of sequences", y="Number of ASV")+
guides(alpha="none")+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold",size=14),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold", size=16),
legend.text = element_text(size = 12))
rareggplotAs we can see, the rarefaction process took place in the asymptote of all samples.
For saving plots in your machine, an easy and versatile option is to use ggsave form ggplot. For publication, you can save it in .TIFF, but there are also other interesting formats. For example, .svg allows you to open the file in PowerPoint and easily modify it. If you have more knowledge of softwares as Illustrator you can save it in .eps
#tiff
ggsave(filename = "rarefaction_curve.tiff", plot = rareggplot,device = tiff(),width = 18, height = 16, units = "cm", dpi = 800)
#svg
library(svglite)
ggsave(filename = "rarefaction_curve.svg", plot = rareggplot,device = svg())#> Saving 7 x 7 in image
#eps
postscript("curve_wito_RE.1.10.eps")
rareggplot
dev.off() # Close the device#> png
#> 2
Let’s continue! Now, we will compute α diversity indices. First, let’s rarefied samples, as everything worked well.
rarefaction <- rarefy_even_depth(location_phyloseq, sample.size = min(sample_sums(location_phyloseq)), rngseed = TRUE)#> `set.seed(TRUE)` was used to initialize repeatable random subsampling.
#> Please record this for your records so others can reproduce.
#> Try `set.seed(TRUE); .Random.seed` for the full vector
#> ...
#Get the seed and record it for reproducible analysis
.Random.seed[1]#> [1] 10403
##[1] 10403#Estimate alpha indices and save it
alpha_diversity_rarefied <- estimate_richness(rarefaction,measures=c("InvSimpson", "Shannon", "Observed"))
## CALCULATE EVENNESS
tASV <- as.data.frame(t(otu_table(rarefaction)))
Evenness_index <- as.data.frame(alpha_diversity_rarefied$Shannon/log(specnumber(tASV)))
Evenness <- cbind(Evenness_index, rownames(alpha_diversity_rarefied))
colnames(Evenness) <- c("Evenness", "Samples")In this step, it is worth mentioning that we have not used diversity estimates of richness. These estimates rely on singletons. However, dada2 doesn’t produce singletons as ASV, they can only appear because of the merging process (being not a real singleton for a richness estimate to rely on). For more details, check dada2 issue #92.
#Generate final table
alpha_diversity_table <- cbind(alpha_diversity_rarefied, Evenness$Evenness, metadata.micro)
colnames(alpha_diversity_table)<- c("Observed", "Shannon", "InvSimpson", "Evenness",colnames(metadata.micro))We got the final table with the diversity indices we want to study, but it is also interesting to produce a table ready for publication.
#Select numeric columns and compute mean and SD
alpha_mean <- aggregate(alpha_diversity_table[,1:4], list(grouping=alpha_diversity_table$location), mean)%>% mutate_if(is.numeric, round, digits=2)
alpha_sd <- aggregate(alpha_diversity_table[,1:4], list(grouping=alpha_diversity_table$location), sd)%>% mutate_if(is.numeric, round, digits=2)
#Paste mean ± SD
mean_sd <- NULL
for (i in 2:5){
mean_sd <- cbind(mean_sd,paste0(alpha_mean[,i], " +/- ", alpha_sd[,i]))}
#generate table and give it columns names
alpha_pub_table <- cbind(alpha_mean$grouping, mean_sd)
colnames(alpha_pub_table) <- c("Location","Observed", "Shannon", "InvSimpson", "Evenness")
as.data.frame(alpha_pub_table)#> Location Observed Shannon InvSimpson Evenness
#> 1 location1 266.5 +/- 60.34 3.74 +/- 0.35 15.3 +/- 6.89 0.67 +/- 0.04
#> 2 location2 288.62 +/- 51.55 3.98 +/- 0.36 22.96 +/- 6.99 0.7 +/- 0.05
#> 3 location3 249.88 +/- 45.05 3.78 +/- 0.28 16.4 +/- 4.25 0.69 +/- 0.04
Statistical analysis
First, we will check the homogeneity of variances and normality of data, in order to choose a proper statistical method later on. Thanks to levene.test.alpha and Shapiro functions from micro4all, we can perform these tests for every index in just one step.
levene<- levene.test.alpha(alpha_diversity_table, 4, "location")#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
#> Warning in leveneTest.default(y = y, group = group, ...): group coerced to factor.
levene#> Df F value Pr(>F) IndexColumn
#> group 2 0.4888944 0.6201127 Observed
#> 21 NA NA Observed
#> group1 2 0.2580918 0.7749391 Shannon
#> 1 21 NA NA Shannon
#> group2 2 0.4535917 0.6414220 InvSimpson
#> 2 21 NA NA InvSimpson
#> group3 2 0.2517571 0.7797466 Evenness
#> 3 21 NA NA Evenness
shapiro <- Shapiro(alpha_diversity_table, 4, "location")
shapiro#> p.value Index
#> shapiro 0.5560323 Observed
#> shapiro.1 0.6978655 Shannon
#> shapiro.2 0.5233759 InvSimpson
#> shapiro.3 0.8245797 Evenness
None of them returns a significant value. That means, according to these statistics tests, every diversity index from our data follows a normal distribution with no significant differences in variance (homoscedasticity). With this information we can perform an ANOVA test. With BalancedAnovafunction from micro4all, it is possible to run an ANOVA test for every index at once. BalancedAnova returns a list with two elements, first element is a data frame with summary results for all indices.
balanced_anova<- BalancedAnova(alpha_diversity_table, numberOfIndexes = 4, formula = "location")
balanced_anova[[1]]#> Df Sum Sq Mean Sq F value Pr(>F) IndexColum
#> data[, formula] 2 6.046583e+03 3.023292e+03 1.089031 0.35480765 Observed
#> Residuals 21 5.829875e+04 2.776131e+03 NA NA Observed
#> data[, formula]1 2 2.627533e-01 1.313767e-01 1.185111 0.32534387 Shannon
#> Residuals 1 21 2.327975e+00 1.108560e-01 NA NA Shannon
#> data[, formula]2 2 2.747742e+02 1.373871e+02 3.605388 0.04507594 InvSimpson
#> Residuals 2 21 8.002272e+02 3.810606e+01 NA NA InvSimpson
#> data[, formula]3 2 3.980461e-03 1.990230e-03 1.173966 0.32862030 Evenness
#> Residuals 3 21 3.560142e-02 1.695306e-03 NA NA Evenness
Remember that, in case you have an unbalanced experiment, you can always use the function UnbalancedAnova in a similar way. Moreover, if you prefer to use a non-parametrical test, the function KruskalWallis is also available.
Now that we have an ANOVA result, imagine we want to test differences between every two groups, that is, L1 vs L2, L2 vs L3 and L1 vs L3. Let’s do that, again for every index, with Tukey.test from micro4all.
tukey <- Tukey.test(alpha_diversity_table, 4, "location", balanced=TRUE)
tukey#> diff lwr upr p adj IndexColumn
#> location2.location1 22.12500000 -44.27816558 88.52816558 0.68300608 Observed
#> location3.location1 -16.62500000 -83.02816558 49.77816558 0.80484680 Observed
#> location3.location2 -38.75000000 -105.15316558 27.65316558 0.32461516 Observed
#> location2.location1.1 0.23770146 -0.18191097 0.65731389 0.34520237 Shannon
#> location3.location1.1 0.03584613 -0.38376631 0.45545856 0.97479481 Shannon
#> location3.location2.1 -0.20185533 -0.62146777 0.21775710 0.45910399 Shannon
#> location2.location1.2 7.66241257 -0.11734048 15.44216562 0.05404339 InvSimpson
#> location3.location1.2 1.09519735 -6.68455570 8.87495040 0.93316996 InvSimpson
#> location3.location2.2 -6.56721522 -14.34696827 1.21253784 0.10827298 InvSimpson
#> location2.location1.3 0.03147496 -0.02041613 0.08336606 0.29814208 Evenness
#> location3.location1.3 0.01391228 -0.03797881 0.06580337 0.77992656 Evenness
#> location3.location2.3 -0.01756268 -0.06945377 0.03432841 0.67493215 Evenness
Another pairwise test option is the Dunn’s test, which is less powerful but make no assumptions on data
dunn <- dunnT(alpha_diversity_table, 4, "location")#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 1.3543, df = 2, p-value = 0.51
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -0.848712
#> | 0.2970
#> |
#> location | 0.265222 1.113935
#> | 0.3954 0.3980
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 2.615, df = 2, p-value = 0.27
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -1.520279
#> | 0.1927
#> |
#> location | -0.282842 1.237436
#> | 0.3886 0.1619
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 5.915, df = 2, p-value = 0.05
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -2.368807
#> | 0.0268
#> |
#> location | -0.707106 1.661700
#> | 0.2398 0.0724
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
#> Kruskal-Wallis rank sum test
#>
#> data: x and group
#> Kruskal-Wallis chi-squared = 2.315, df = 2, p-value = 0.31
#>
#>
#> Comparison of x by group
#> (Benjamini-Hochberg)
#> Col Mean-|
#> Row Mean | location location
#> ---------+----------------------
#> location | -1.520279
#> | 0.1927
#> |
#> location | -0.707106 0.813172
#> | 0.2398 0.3121
#>
#> alpha = 0.05
#> Reject Ho if p <= alpha/2
dunn[[2]]#> chi2 Z P P.adjusted comparisons IndexColumn
#> 1 1.354339 -0.8487127 0.198020600 0.29703090 location1 - location2 Observed
#> 2 1.354339 0.2652227 0.395418952 0.39541895 location1 - location3 Observed
#> 3 1.354339 1.1139354 0.132653458 0.39796037 location2 - location3 Observed
#> 4 2.615000 -1.5202796 0.064220362 0.19266109 location1 - location2 Shannon
#> 5 2.615000 -0.2828427 0.388648705 0.38864871 location1 - location3 Shannon
#> 6 2.615000 1.2374369 0.107962469 0.16194370 location2 - location3 Shannon
#> 7 5.915000 -2.3688077 0.008922764 0.02676829 location1 - location2 InvSimpson
#> 8 5.915000 -0.7071068 0.239750061 0.23975006 location1 - location3 InvSimpson
#> 9 5.915000 1.6617009 0.048286377 0.07242957 location2 - location3 InvSimpson
#> 10 2.315000 -1.5202796 0.064220362 0.19266109 location1 - location2 Evenness
#> 11 2.315000 -0.7071068 0.239750061 0.23975006 location1 - location3 Evenness
#> 12 2.315000 0.8131728 0.208059496 0.31208924 location2 - location3 Evenness
When using KruskalWallis function, the pairwise test to be implemented should not be Tukey’s test, because it assumes normal distribution of data. Instead, use Dunn’s test or Mann–Whitney–Wilcoxon. micro4allalso implements looping functions for these tests
wilcoxon <- wilcoxon.test(alpha_diversity_table, 4, "location", p.adjust.method = "BH")#> Warning in wilcox.test.default(xi, xj, paired = paired, ...): cannot compute exact p-value with
#> ties
wilcoxon#> location1 location2 Index
#> location2 0.69288072 NA Observed
#> location3 0.87847708 0.6928807 Observed
#> location2.1 0.41794872 NA Shannon
#> location3.1 0.57373737 0.3911422 Shannon
#> location2.2 0.08438228 NA InvSimpson
#> location3.2 0.44180264 0.1244755 InvSimpson
#> location2.3 0.44180264 NA Evenness
#> location3.3 0.44180264 0.4418026 Evenness
Plot \(\alpha\) diversity indices
alpha_plot_table <- tidyr::gather(alpha_diversity_table, key = "Measure", value = "Value", Observed, Shannon, InvSimpson, Evenness)
alpha_graphic <- ggplot(data = alpha_plot_table, aes(x = location, y = Value)) +
facet_wrap(~Measure, scale = "free") +
geom_boxplot()+
theme(axis.text.x = element_text(size=13),
legend.position="bottom", strip.text = element_text(size = 20), axis.text.y = element_text(size=15), axis.title.y=element_text(size=20)) +
scale_x_discrete(limits=c("location1",
"location2","location3"), breaks=c("location1",
"location2","location3"), ##With breaks and labels you can change the name displayed on labels
labels=c("Location 1",
"Location 2","Location 3")) +
aes(fill=location)+ scale_fill_manual(values = c( "location1"= "#33CC00",
"location2"= "#ffd624", "location3"="#80CC88"), na.translate=FALSE) +
theme(legend.key.size = unit(1, 'cm')) +
ylab("Alpha Diversity Measure")
alpha_graphic#use ggsave for saving ggplots
#ggsave("alpha_plot.tiff", plot = alpha_graphic,width = 17, height = 30, units = "cm", dpi = 800 )Both statistical analysis and graphic show no significant differences between groups. Yet there could be distinct groups when studying β diversity, i.e. integrating abundance and taxonomic information.
\(\beta\) diversity analysis
Data preparation
For \(\beta\) diversity analysis, we first normalize data making use of edgeR8 package. This is implemented to account for differences in library sizes between samples, as well as for compositionality.
library(edgeR); packageVersion("edgeR")#> [1] '3.34.1'
edgeR <- DGEList(counts = OTU, samples = metadata.micro, genes = tax)
edgeR <- calcNormFactors(edgeR)
##EXTRACT NORMALIZED COUNTS
ASV_norm <- cpm(edgeR, normalized.lib.sizes=T, log=F)
##CREATE NORMALIZED PHYLOSEQ OBJECT
phy_OTU_norm<-otu_table(as.data.frame(ASV_norm,row.names=F), taxa_are_rows = T)
phy_taxonomy_norm<-tax_table(as.matrix(tax))
phy_metadata_norm<-sample_data(metadata.micro)
##Add taxa names
taxa_names(phy_OTU_norm)<- taxa_names(phy_taxonomy_norm)
#Check
identical(rownames(ASV_norm), rownames(tax))#> [1] TRUE
##Merge
normalized_phyloseq<-phyloseq(phy_OTU_norm,phy_taxonomy_norm,phy_metadata_norm,tree_root)Once we have a normalized phyloseq object, it is possible to analyze \(\beta\) diversity. A distance method will give us a measure of dissimilarities between replicates. Then, PERMANOVA tests wheter differences between groups are significant or not. We will make use of Permanova function from micro4all, in order to implement different kind of distance measures at once.
permanova<- Permanova(normalized_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location")
permanova[[1]]#> Df SumsOfSqs MeanSqs F.Model R2 Pr(>F) distances
#> location 2 1.4040287 0.70201437 4.510496 0.3004895 0.001 bray
#> Residuals 21 3.2684435 0.15564017 NA 0.6995105 NA bray
#> Total 23 4.6724722 NA NA 1.0000000 NA bray
#> location1 2 0.5426057 0.27130284 2.885410 0.2155638 0.005 unifrac
#> Residuals1 21 1.9745409 0.09402576 NA 0.7844362 NA unifrac
#> Total1 23 2.5171466 NA NA 1.0000000 NA unifrac
#> location2 2 0.1382008 0.06910039 2.118345 0.1678782 0.030 wunifrac
#> Residuals2 21 0.6850197 0.03261999 NA 0.8321218 NA wunifrac
#> Total2 23 0.8232205 NA NA 1.0000000 NA wunifrac
Bray-Curtis distance seems to explained the most variance (expressed as R2 in the above table). Nevertheless, another test is necessary to confirm no differences are found between dispersion. If significant differences are to be found with a betadisper test, then differences found with PERMANOVA could be only due to distinct dispersion and further graphic representation would be needed.
Again, we can loop over distances measures thanks to micro4all
betadisper<- Betadispersion(location_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location")
betadisper#> [[1]]
#> Df Sum Sq Mean Sq F N.Perm Pr(>F) distances
#> Groups 2 0.009769604 0.004884802 0.7280460 999 0.476 bray
#> Residuals 21 0.140898839 0.006709469 NA NA NA bray
#> Groups1 2 0.009311883 0.004655941 0.8646057 999 0.461 unifrac
#> Residuals1 21 0.113085961 0.005385046 NA NA NA unifrac
#> Groups2 2 0.003544450 0.001772225 0.6221726 999 0.533 wunifrac
#> Residuals2 21 0.059817359 0.002848446 NA NA NA wunifrac
#>
#> [[2]]
#> [[2]]$bray
#>
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#>
#> Response: Distances
#> Df Sum Sq Mean Sq F N.Perm Pr(>F)
#> Groups 2 0.00977 0.0048848 0.728 999 0.476
#> Residuals 21 0.14090 0.0067095
#>
#> [[2]]$unifrac
#>
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#>
#> Response: Distances
#> Df Sum Sq Mean Sq F N.Perm Pr(>F)
#> Groups 2 0.009312 0.0046559 0.8646 999 0.461
#> Residuals 21 0.113086 0.0053850
#>
#> [[2]]$wunifrac
#>
#> Permutation test for homogeneity of multivariate dispersions
#> Permutation: free
#> Number of permutations: 999
#>
#> Response: Distances
#> Df Sum Sq Mean Sq F N.Perm Pr(>F)
#> Groups 2 0.003544 0.0017722 0.6222 999 0.533
#> Residuals 21 0.059817 0.0028485
In this example, dispersion was not statistically significant different between the three groups.
But, which groups are distinct? We can make a pairwiseAdonis test to check for differences. Again, micro4allgives us the option to loop over distances
pairwise<- PairwiseAdonisFun(normalized_phyloseq,distances = c("bray", "unifrac", "wunifrac"), formula = "location", pval=0.05)
pairwise#> pairs Df SumsOfSqs F.Model R2 p.value p.adjusted sig distances
#> 1 location1 vs location2 1 0.3391022 2.076899 0.1291853 0.004 0.004 * bray
#> 2 location1 vs location3 1 0.9248550 6.081317 0.3028346 0.001 0.003 * bray
#> 3 location2 vs location3 1 0.8420859 5.555908 0.2841038 0.002 0.003 * bray
#> 4 location1 vs location2 1 0.2674323 2.821635 0.1677384 0.042 0.042 . unifrac
#> 5 location1 vs location3 1 0.2743299 2.772519 0.1653013 0.016 0.042 . unifrac
#> 6 location2 vs location3 1 0.2721463 3.080251 0.1803399 0.035 0.042 . unifrac
It seems that, according to statistical analysis used and with Bray-Curtis distances, all groups are grouped distinctively. For visualization, we can use PCoA and NMDS with every measure. Here it is a loop for producing every possible graphic. It assigns each graphic to elements of a list.
var_list <- c("bray", "unifrac", "wunifrac")
plot_type <- c("PCoA", "NMDS")
combination_plot <- purrr::cross2(plot_type,var_list )
# Make plots.
plot_list = list()
for (i in 1:length(combination_plot)){
pcoa <- ordinate(normalized_phyloseq,combination_plot[[i]][[1]],combination_plot[[i]][[2]])
plot = plot_ordination(normalized_phyloseq, pcoa, type="samples", color="location")+
geom_text(aes(label=location), hjust=0, vjust=0, show.legend=FALSE)+
geom_point(size=4)
if (combination_plot[[i]][[1]]=="NMDS"){
plot= plot + xlab(paste("Axis 1"))+
ylab(paste("Axis 2"))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="#80CC88"))+
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances", round(pcoa$stress, digits = 3)))
plot_list[[i]] = plot
}
else {
plot= plot + xlab(paste("PCo 1", paste("(",round(pcoa$values[1,2]*100,1),"%",")",sep=""),sep=" "))+
ylab(paste("PCo 2", paste("(",round(pcoa$values[2,2]*100,1),"%",")",sep=""),sep=" "))+
theme(legend.position="bottom", plot.title = element_text(face="bold", hjust = 0.5), legend.title = element_blank(), legend.key = element_blank(),
panel.border = element_rect(linetype = "solid", fill = NA),
panel.background = element_rect(fill="white", colour="white"),
panel.grid.major = element_line(colour="aliceblue", size=0.4),
panel.grid.minor= element_line(colour = "aliceblue", size =0.4))+
scale_color_manual(values=c("location1"= "#33CC00",
"location2"= "#ffd624", "location3"="#80CC88"))+
guides(color=guide_legend(nrow=2,byrow=TRUE))+
guides(shape=guide_legend(nrow=2,byrow=TRUE))+
theme(plot.title = element_text(face="bold", hjust = 0.5))+
ggtitle(paste("Bacterial Community", combination_plot[[i]][[1]], "on", combination_plot[[i]][[2]], "distances"))
plot_list[[i]] = plot
}
}
plot_list[[1]]Going deeper in the analysis, we can get a bar plot with the variance explained by every axis. However, we should be careful in this sense because we are using non-Euclidian distances. This is why we can find negative eigenvalues (imaginary dimensions). If we want to visualize corrected eigenvalues, select element Corr_eig in pcoa.
PCOA_Wunifrac <- ordinate(normalized_phyloseq, "PCoA", "wunifrac", formula="location")
length(PCOA_Wunifrac$values$Relative_eig)#> [1] 24
barplot(100*PCOA_Wunifrac$values$Relative_eig, ylim=c(0,10+100*max(PCOA_Wunifrac$values$Relative_eig)),
xlab=c("Axes"),ylab = c("Eigenvalue (%)"),axisnames = T,
axis.lty = 1)And with the next line, we can get the percentage of variance explained by the number of axes we choose.
sum(PCOA_Wunifrac$values$Relative_eig[1:10])#> [1] 0.9826669
Produce tables with relative abundance
For further analysis, or to represent data, it is useful to have tables with relative abundance at different levels. First, we usually produce a table with relative abundance by sample.
##CALCULATE RELAIVE ABUNDANCE
sample_relabun <- transform_sample_counts(location_phyloseq, function(x){x/sum(x)}*100)
##CONSTRUCT THE TABLE
OTU_sample <- as.data.frame((otu_table(sample_relabun)))
taxonomy_sample <- as.data.frame(tax_table(sample_relabun))
identical(rownames(OTU_sample),rownames(taxonomy_sample))#> [1] TRUE
sample_relabun_table <- cbind(taxonomy_sample, OTU_sample)
#SORT IT
sample_relabun_table <- sample_relabun_table[sort(sample_relabun_table$ASV_names, decreasing = FALSE),]
#Check relative abundance sums 100
colSums(sample_relabun_table[,8:ncol(sample_relabun_table)])#> L1_1 L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_4 L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100 100
head(sample_relabun_table)#> Kingdom Phylum Class Order Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0003 Bacteria unclassified unclassified unclassified unclassified
#> ASV0006 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0008 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> Genus ASV_names L1_1 L1_2 L1_3 L1_4 L1_5
#> ASV0002 Streptomyces ASV0002 13.8889992 4.937848 10.893415 22.1990687 9.5526244
#> ASV0003 unclassified ASV0003 3.8330224 27.683538 26.364099 15.4691786 10.6119119
#> ASV0006 unclassified ASV0006 2.5943864 6.496807 7.246131 10.9847116 19.6124558
#> ASV0007 Pseudonocardia ASV0007 1.3855254 3.001168 2.136059 0.9440983 0.4732119
#> ASV0008 unclassified ASV0008 11.3065227 2.218254 2.586520 1.1120798 7.1675279
#> ASV0009 Promicromonospora ASV0009 0.1905594 1.504018 2.310431 1.8605541 0.1060647
#> L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5
#> ASV0002 6.9848271 20.4381884 5.620269 7.6930658 7.2420046 11.2892520 3.0031259 14.4508131
#> ASV0003 24.8232220 10.3801344 5.665003 10.5471676 3.6577360 22.6933878 7.1731339 1.5160285
#> ASV0006 6.2825092 5.3759593 1.043176 6.6385461 4.5674805 7.4010722 7.1764071 1.9043707
#> ASV0007 1.8744457 2.3142073 0.925081 2.0439942 0.3016851 0.5514424 1.8673388 0.6254551
#> ASV0008 4.2882140 3.0677164 6.514932 14.3000749 12.2581049 7.5312739 1.3452695 4.4099251
#> ASV0009 0.4674129 0.3419619 3.689588 0.2030197 7.7359552 0.1582844 0.5007937 7.1880659
#> L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5
#> ASV0002 12.829278 8.654647 9.041784 13.6220214 5.024374 17.70502153 7.8901345 8.995795
#> ASV0003 0.921740 9.530792 9.079588 8.5899753 11.863936 10.48276173 11.3430493 3.270292
#> ASV0006 1.774789 6.913218 5.461540 0.3648316 1.406825 0.62233647 0.7892377 1.592300
#> ASV0007 1.442515 4.163499 3.624719 14.8167625 16.529120 5.98209655 9.9327354 4.862592
#> ASV0008 8.960207 1.828319 6.101981 0.0000000 0.000000 0.00000000 0.0000000 0.000000
#> ASV0009 2.551159 1.178451 1.461006 7.1783073 1.233644 0.03156779 1.8609865 21.526934
#> L3_6 L3_7 L3_8
#> ASV0002 19.96646697 19.7996690 5.78009910
#> ASV0003 14.42738258 9.9207386 21.88830856
#> ASV0006 2.58040771 0.1428447 0.82764162
#> ASV0007 16.93827060 3.8899051 13.41368689
#> ASV0008 0.00000000 0.0000000 0.00000000
#> ASV0009 0.01840227 0.5853149 0.03214142
Of course, we can do it at genus level or phylum level
#Glom phyloseq at taxonomical level
location_phyloseq_genus <- tax_glom(location_phyloseq, taxrank = "Genus")
sample_relabun_genus <- transform_sample_counts(location_phyloseq_genus, function(x){x/sum(x)}*100)
##CONSTRUCT THE TABLE
OTU_sample_genus <- as.data.frame((otu_table(sample_relabun_genus)))
taxonomy_sample_genus <- as.data.frame(tax_table(sample_relabun_genus)[,-7])
identical(rownames(OTU_sample_genus),rownames(taxonomy_sample_genus))#> [1] TRUE
sample_table_genus <- cbind(taxonomy_sample_genus, OTU_sample_genus)Furhtermore, we would like to have all unclassifieds at genus level grouped together
##Agglomerate unclassified in out table
sample_table_genus_unclass <- sample_table_genus %>% subset(Genus=="unclassified", select=c(7:ncol(sample_table_genus))) %>%
colSums() %>% t() %>% as.data.frame() %>% cbind(Kingdom="unclassified", Phylum="unclassified", Class="unclassified",
Order="unclassified", Family="unclassified", Genus="unclassified", .)
sample_table_genus_final <- rbind(subset(sample_table_genus, Genus!="unclassified"), sample_table_genus_unclass)
#Check relative abundance sums 100
colSums(sample_table_genus_final[,8:ncol(sample_table_genus_final)])#> L1_2 L1_3 L1_4 L1_5 L1_6 L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6 L2_7 L2_8 L3_1 L3_2 L3_3 L3_4
#> 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
#> L3_5 L3_6 L3_7 L3_8
#> 100 100 100 100
head(sample_table_genus_final)#> Kingdom Phylum Class Order Family
#> ASV0800 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0303 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0054 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0020 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> Genus L1_1 L1_2 L1_3 L1_4 L1_5 L1_6
#> ASV0800 Yinghuangia 0.0000000 0.000000 0.1017171 0.00000000 0.000000000 0.01438194
#> ASV0007 Pseudonocardia 6.9494621 4.721516 8.2560364 4.98628506 8.218656513 6.60370575
#> ASV0010 Amycolatopsis 7.5211402 2.812307 5.9238091 5.62844203 5.384824585 8.73462931
#> ASV0303 Saccharothrix 0.1131446 0.000000 0.1477320 0.00000000 0.000000000 0.10546753
#> ASV0054 Saccharopolyspora 0.0317599 4.937848 0.0000000 0.02551617 0.009518629 0.17018625
#> ASV0020 Actinophytocola 0.3334789 1.637937 1.0292800 3.37238725 1.419635572 0.42666411
#> L1_7 L1_8 L2_1 L2_2 L2_3 L2_4 L2_5 L2_6
#> ASV0800 0.02584596 0.00000000 0.000000 0.00000000 0.00000000 0.02782187 0.0000000 0.0000000
#> ASV0007 5.39981709 7.61894537 9.589230 4.20639635 5.54505999 3.75595306 4.0253169 6.6039394
#> ASV0010 5.12545230 7.56347630 11.692356 11.93922531 11.59050294 7.65756182 7.7873826 11.8180802
#> ASV0303 0.12127719 0.04294380 0.000000 0.02188389 0.00000000 0.28803823 0.0000000 0.0000000
#> ASV0054 5.67815818 0.06620502 0.000000 0.00000000 0.03829461 0.11456066 0.0000000 0.2460103
#> ASV0020 1.87880234 0.24871616 3.201009 1.70850658 1.77687005 1.76750732 0.9521854 0.3147015
#> L2_7 L2_8 L3_1 L3_2 L3_3 L3_4 L3_5
#> ASV0800 0.00000000 0.00000000 0.05094495 0.0000000 0.00000 0.00000000 0.08370035
#> ASV0007 9.02031063 8.99286175 22.29252260 23.2938510 11.42529 16.30269058 9.04163096
#> ASV0010 3.93903190 7.29836109 2.71980279 9.5014111 4.71713 5.62780269 4.43014010
#> ASV0303 0.05249629 0.02446130 0.00000000 0.0000000 0.00000 0.14798206 0.00000000
#> ASV0054 0.24256906 0.07116013 0.63434675 0.2501497 0.00000 0.05381166 0.00000000
#> ASV0020 1.93693204 1.49213903 0.14626130 0.9599761 0.00000 0.23542601 0.03387871
#> L3_6 L3_7 L3_8
#> ASV0800 0.00000000 0.00000000 0.0000000
#> ASV0007 25.36855665 6.38271928 22.1882952
#> ASV0010 5.16899421 2.50675028 13.2610151
#> ASV0303 0.02658106 0.05226026 0.0000000
#> ASV0054 0.00000000 1.48244926 0.2062408
#> ASV0020 0.47845912 0.00000000 2.3650730
Next, we can group it by the variable of interest, in our case, location (at genus level)
otu_genus <- sample_table_genus_final[,7:ncol(sample_table_genus_final)] %>% t() %>% as.data.frame()
#Save TAXonomy data
tax_genus <- sample_table_genus_final[,1:6]
#Calculate OTU mean abundance based on grouping factor (e.g., location)
location_mean <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t()
#Calculate OTU SD based on grouping factor (e.g., location) and change colnames
location_SD <- aggregate(otu_genus, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>%
as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_mean), "SD"))
#Merge mean abundance, SD and taxonomy.
genus_location_table <- merge(tax_genus, location_mean, by=0) %>%column_to_rownames("Row.names") %>%
merge(location_SD, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(genus_location_table[,7:ncol(genus_location_table)])#> location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 51.76728 59.31652 66.66386
head(genus_location_table)#> Kingdom Phylum Class Order Family
#> 1 unclassified unclassified unclassified unclassified unclassified
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0020 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> Genus location1 location2 location3 location1SD location2SD location3SD
#> 1 unclassified 42.679895 39.071388 26.6895552 8.899015 10.4270430 8.3959878
#> ASV0002 Streptomyces 17.979426 13.409487 17.3227189 8.244194 5.5368089 9.1979191
#> ASV0007 Pseudonocardia 6.594303 6.467384 17.0369440 1.419046 2.4476910 7.2917718
#> ASV0009 Promicromonospora 1.440138 2.642542 4.0584122 1.461164 3.0727149 7.4498456
#> ASV0010 Amycolatopsis 6.086760 9.215313 5.9916308 1.837111 2.9749138 3.6426730
#> ASV0020 Actinophytocola 1.293363 1.643731 0.5273843 1.044258 0.8294456 0.8103738
We can also aggregate by location at ASV level
#Save OTU data
otu_location_ASV <- sample_relabun_table[,8:ncol(sample_relabun_table)] %>% t() %>% as.data.frame()
#Save Taxnomy data
tax_location_ASV <- sample_relabun_table[,1:7]
#Calculate OTU mean abundance based on grouping factor (e.g., PLOT)
location_ASV_mean <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=mean)%>% column_to_rownames("Group.1") %>% t()
#Calculate OTU SD based on grouping factor (e.g., PLOT) and change colnames
location_ASV_SD <- aggregate(otu_location_ASV, by=list(metadata.micro$location), FUN=sd)%>% column_to_rownames("Group.1") %>% t() %>%
as.data.frame() %>% rename_with(.fn= ~paste0(colnames(location_ASV_mean), "SD"))
#Merge mean abundance, SD and taxonomy.
ASV_location_table <- merge(tax_location_ASV, location_ASV_mean, by=0) %>%column_to_rownames("Row.names") %>%
merge(location_ASV_SD, by=0) %>% column_to_rownames("Row.names")
#Check abundances sum 100
colSums(ASV_location_table[,8:ncol(ASV_location_table)])#> location1 location2 location3 location1SD location2SD location3SD
#> 100.00000 100.00000 100.00000 111.04717 115.16294 98.63739
Taxonomical profile graphics
Usually, taxonomical profiles are interesting graphics to present when publishing data. They give a visual idea of the main taxonomical groups present in our comminities.
First, we prepare a table, so we can label genera representing less than a 1% of the sequences as ‘other genera’. Moreover, we reorder this table in a way that graphic will represent genera, unclassified and other genera (from bottom to top).
taxonomical_genus_location <- genus_location_table %>% select(6:9) %>% #select columns with genera and abundance
pivot_longer(!Genus, names_to="Samples", values_to="Abundance") %>%
mutate(Genus = case_when(Abundance<= 1.0 & Genus!="unclassified" ~ "Other genera (<1%)", TRUE ~ as.character(Genus)))
#Reorder table (first, descendant, then unclassified and other phyla)
taxonomical_genus_location <- taxonomical_genus_location %>% filter(Abundance > 1 & Genus!="unclassified") %>%
arrange(desc(Abundance))
taxonomical_genus_location <- rbind(taxonomical_genus_location, filter(taxonomical_genus_location,Genus=="unclassified"|Genus=="Other genera (<1%)"))
head(taxonomical_genus_location)#> # A tibble: 6 × 3
#> Genus Samples Abundance
#> <chr> <chr> <dbl>
#> 1 Streptomyces location1 18.0
#> 2 Streptomyces location3 17.3
#> 3 Pseudonocardia location3 17.0
#> 4 Streptomyces location2 13.4
#> 5 Amycolatopsis location2 9.22
#> 6 Pseudonocardia location1 6.59
Afterwards, we create an expression to introduce in ggplot, in a way that phylum and genera names are writting in italics.
library(gdata)#Get labels for location
location_label <- levels(as.factor(unique(taxonomical_genus_location$Samples)))
#Get unique names for genera
unique_genera <- unique(as.vector(taxonomical_genus_location$Genus))
#REORDER FACTORS LEVELS IN DATA FRAME
taxonomical_genus_location$Genus=reorder.factor(taxonomical_genus_location$Genus,new.order=rev(unique_genera))
sorted_labels_genus<- as.data.frame(unique_genera)
##CREATE AN EXPRESSION FOR GGPLOT WITH ITALICS WHEN NEEDED.
sorted_labels_ggplot <- sapply(sorted_labels_genus$unique_genera,
function(x) if (x == "Other phyla (<1%)"|x == "unclassified"|x == "Other genera (<1%)")
{parse(text=paste0("'", as.character(x), "'"))} else {parse(text = paste0("italic('",as.character(x), "')"))})When having to color a great number of objects, randomcoloR package can be very useful. You can combine it with package scales to display the random generated colors, as well as its codes.
library(randomcoloR);packageVersion("randomcoloR")#> [1] '1.1.0.1'
library(scales);packageVersion("scales")#> [1] '1.1.1'
colors_genus <- randomColor(count=length(unique_genera), hue=c("random"))
show_col(colors_genus)If you want to change some of these colors, it can be easily done as follows:
colors_genus[5] <- "#448800"In this tutorial, we will use the next palette, because it has been previously optimized to include very distinctive colors
c21 <- c(
"dodgerblue2", "#E31A1C", # red
"green4",
"#6A3D9A", # purple
"#FF7F00", # orange
"black", "gold1",
"skyblue2", "#FB9A99", # lt pink
"#CAB2D6", # lt purple
"#FDBF6F", # lt orange
"gray70", "khaki2",
"maroon", "steelblue4",
"darkturquoise", "green1", "yellow4", "yellow3",
"darkorange4", "brown"
)At last, let’s plot taxonomical profile at genus level!
ggplot_title <- "Bacterial genera by location"
ggGenera=ggplot(taxonomical_genus_location, aes(x=Samples, y=Abundance, fill=Genus, order=Genus)) + geom_bar(stat="identity", position="stack")+
scale_fill_manual(values=c21,
breaks=unique_genera,
labels=sorted_labels_ggplot)+
labs(y="Mean relative abundance (%)", x=NULL, fill="Genus",title=ggplot_title)+
guides(fill = guide_legend(reverse = TRUE))+
scale_x_discrete(limits=location_label,
labels=location_label)+
scale_y_continuous(expand=c(0.01,0.01),
breaks=c(0,10,20,30,40,50,60,70,80,90,100),
labels=c("0","10", "20","30","40","50","60","70","80","90","100"),
limits = c(NA, 100))+
theme_bw()+
theme(panel.border = element_rect(colour="black"), axis.title.x=element_blank(),
plot.title = element_text(face="bold", hjust = 0.5, size=16), axis.text = element_text(size = 14),
axis.text.x = element_text(face="bold", size=10, angle = 45, vjust=1, hjust = 1), axis.title.y = element_text(size = 16),
legend.key.size = unit(0.9, "cm"), legend.text = element_text(size = 12),
legend.title = element_text(size=14, face="bold"), legend.title.align=0.5)
ggGeneraANCOM-BC analysis
There are several methods to perform differential abundance analysis on microbial community data. Among them, we choose ANCOMBC2, because it is specifically designed for this kind of data, accounting for its compositionality. Unfortunately, when having multiple groups to compare, ancombc function only compares the first group against all others. This is a big drawback and requires the user to change the input phyloseq object for every other combination between groups. To solve this problem, micro4allhas implemented the function ancomloop. Apart from looping over groups, it also returns a table with ANCOM corrected logarithmic abundances. Let’s see how it works!
library(ANCOMBC);packageVersion("ANCOMBC")#> [1] '1.2.2'
library(microbiome);packageVersion("microbiome")#> [1] '1.14.0'
library(car);packageVersion("car")#> [1] '3.0.11'
library(ggplot2);packageVersion("ggplot2")#> [1] '3.3.5'
library(gridExtra);packageVersion("gridExtra")#> [1] '2.3'
ANCOM_location_genus <- ancomloop(input_object_phyloseq = location_phyloseq_genus, grouping = "location", ancom.options = list(global=FALSE, struc_zero=TRUE),out.unclassified = TRUE, tax.level="Genus") #When performing ancombc at genus level, we can filter unclassified genera out with out.unclassified and tax.level arguments#> [1] "location1" "location2" "location3"
#> [1] "location2" "location1" "location3"
#> [1] "location3" "location1" "location2"
head(ANCOM_location_genus[["location1"]])#> Kingdom Phylum Class Order Family
#> ASV0002 Bacteria Actinobacteria Actinobacteria Streptomycetales Streptomycetaceae
#> ASV0007 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0009 Bacteria Actinobacteria Actinobacteria Micrococcales Promicromonosporaceae
#> ASV0010 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0020 Bacteria Actinobacteria Actinobacteria Pseudonocardiales Pseudonocardiaceae
#> ASV0021 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Bradyrhizobiaceae
#> Genus ASV_names location1MeanLogAbun location2MeanLogAbun
#> ASV0002 Streptomyces <NA> 8.957076 8.759109
#> ASV0007 Pseudonocardia <NA> 8.021944 8.045645
#> ASV0009 Promicromonospora <NA> 5.952887 6.492680
#> ASV0010 Amycolatopsis <NA> 7.915451 8.407511
#> ASV0020 Actinophytocola <NA> 6.082074 6.583236
#> ASV0021 Bradyrhizobium <NA> 5.960075 6.315572
#> location3MeanLogAbun location1SD location2SD location3SD location1vslocation2_beta
#> ASV0002 8.484159 0.6493031 0.5991199 0.6372969 -0.19796656
#> ASV0007 8.498512 0.3696452 0.4329206 0.7333242 0.02370005
#> ASV0009 5.103344 1.2952050 1.3736179 2.4417865 0.53979248
#> ASV0010 7.411011 0.3356190 0.5825709 0.8391443 0.49206023
#> ASV0020 3.385740 1.2996042 0.8690411 2.8593764 0.50116212
#> ASV0021 6.344847 0.4432286 0.3798151 0.3854025 0.35549642
#> location1vslocation2_SE location1vslocation2_W location1vslocation2_pvalue
#> ASV0002 0.2921840 -0.6775408 0.49806287
#> ASV0007 0.1882653 0.1258865 0.89982180
#> ASV0009 0.6243831 0.8645212 0.38730162
#> ASV0010 0.2223526 2.2129730 0.02689951
#> ASV0020 0.5170442 0.9692829 0.33240408
#> ASV0021 0.1930422 1.8415480 0.06554129
#> location1vslocation2_padjusted location1vslocation2_diff location1vslocation3_beta
#> ASV0002 1 FALSE -0.4729166
#> ASV0007 1 FALSE 0.4765678
#> ASV0009 1 FALSE -0.8495434
#> ASV0010 1 FALSE -0.5044403
#> ASV0020 1 FALSE -2.6963336
#> ASV0021 1 FALSE 0.3847719
#> location1vslocation3_SE location1vslocation3_W location1vslocation3_pvalue
#> ASV0002 0.3008891 -1.5717306 0.116013048
#> ASV0007 0.2715929 1.7547136 0.079308315
#> ASV0009 0.9141179 -0.9293586 0.352703260
#> ASV0010 0.2988944 -1.6876875 0.091471235
#> ASV0020 1.0387420 -2.5957684 0.009437965
#> ASV0021 0.1942496 1.9808117 0.047612394
#> location1vslocation3_padjusted location1vslocation3_diff
#> ASV0002 1 FALSE
#> ASV0007 1 FALSE
#> ASV0009 1 FALSE
#> ASV0010 1 FALSE
#> ASV0020 1 FALSE
#> ASV0021 1 FALSE
We can filter these results to get only the significant ones and the comparison we would like to analyze
ANCOM_location1_genus <- ANCOM_location_genus[["location1"]]
ANCOM_loc1vsloc2_genus_sig <- ANCOM_location1_genus[,1:19] %>% filter(location1vslocation2_diff,TRUE)ANCOM-BC graphic
Graphical representation of differential analysis results can be achieved with two approaches: log fold change with corrected abundances or relative abundance without correction. Here, we will show how to accomplish both graphics. For publication purposes, we rather represent both graphics in one figure.
We will start with log fold change. First, we will bind phylum and genus name (this can be replicated when representing ASV).
#Bind phylum name to genus name
graphic_loc1vsloc2_genus<-ANCOM_loc1vsloc2_genus_sig
for (i in 1:nrow(graphic_loc1vsloc2_genus)){
graphic_loc1vsloc2_genus$Classification[i]<- paste(graphic_loc1vsloc2_genus$Phylum[i],graphic_loc1vsloc2_genus$Genus[i],sep="|")
}Next, we create a data frame with data (classification and log fold change) and filter out those genera where log fold change is 0 or less than a specific value. Remember, log fold change correspond to beta parameter in res element from ancombc function.
##Create the data frame for representation, with Classification, Log fold change and standard deviation
logfold_df <- graphic_loc1vsloc2_genus %>% select(Classification, location1vslocation2_beta, location1vslocation2_SE)
colnames(logfold_df)<- c("Genus", "LogFold_location1vslocation2", "SD")
##Filter out 0 log fold change and assign name according to the direction of change
logfold_df <- logfold_df %>%
filter(LogFold_location1vslocation2 != 0) %>% arrange(desc(LogFold_location1vslocation2)) %>%
mutate(group = ifelse(LogFold_location1vslocation2 > 0, "Location2", "Location1"))
logfold_df$Genus = factor(logfold_df$Genus, levels = logfold_df$Genus)
#FILTER WHEN NEEDED. This can be used to filter the results according to log fold change level
range(logfold_df$LogFold_location1vslocation2)#> [1] -2.926409 3.195890
logfold_df_filtered <- rbind(logfold_df[which(logfold_df$LogFold_location1vslocation2>=1.5),], logfold_df[which(logfold_df$LogFold_location1vslocation2<=-1.5),])#If not filtering, use in plotting logfold_dfNow that we have prepared all data, we can use ggplot for plotting log fold change
waterfall_location1 <- ggplot(data = logfold_df_filtered,
aes(x = Genus, y = LogFold_location1vslocation2, fill = group, color = group)) +
geom_bar(stat = "identity", width = 0.7,
position = position_dodge(width = 0.4)) +
geom_errorbar(aes(ymin = LogFold_location1vslocation2 - SD, ymax = LogFold_location1vslocation2 + SD), width = 0.2,
position = position_dodge(0.05), color = "black") +
labs(x = NULL, y = "Log fold change",
title = "Waterfall Plot for the Location Effect") +
theme_bw() +
theme(legend.position = "right",
legend.title =element_blank(),
plot.title = element_text(hjust = 0.5),
panel.grid.minor.y = element_blank(),
axis.text.x = element_text(angle = 60, hjust = 1), plot.margin=margin(10,10,10,100))
waterfall_location1As we mentioned above, we will also produce a graphic with relative abundance, specifically, a back-to-back plot.
#Again, bind phylum name and genus name
pyramidal_df<- genus_location_table
for (i in 1:nrow(pyramidal_df)){
pyramidal_df$Classification[i]<- paste(pyramidal_df$Phylum[i],pyramidal_df$Genus[i],sep="|")
}
#Now, transform it to ggplot format
pyramidal_loc1vsloc2 <- pyramidal_df %>% select(c(13, 7:8)) %>% #select columns with genera and abundance
pivot_longer(!Classification, names_to="Samples", values_to="Abundance")# %>%
#Filter pyrmaidal_loc1vsloc2 to include only significat genera according to ANCOM
pyramidal_loc1vsloc2_filt <- pyramidal_loc1vsloc2[which(pyramidal_loc1vsloc2$Classification %in%logfold_df_filtered$Genus ),]
#Split table according to location
loc1pyrm <- pyramidal_loc1vsloc2_filt[which(pyramidal_loc1vsloc2_filt$Samples=="location1"),]
loc2pyrm <- pyramidal_loc1vsloc2_filt[which(pyramidal_loc1vsloc2_filt$Samples=="location2"),]
#GRAPHIC
pyramidalggplot=ggplot(data = pyramidal_loc1vsloc2_filt, mapping= aes(x = Classification, y=Abundance, fill = Samples), colour="white")+
geom_bar(loc1pyrm, stat="identity", mapping=aes(y=-Abundance))+
geom_bar(data=loc2pyrm, stat="identity")+
theme_bw()+
scale_y_continuous(expand=c(0,0), labels=abs, limits=c(-3,3), breaks=seq(-3,3,1))+
labs(y="Relative abundance (%)")+
theme(legend.title=element_blank(), axis.title.y=element_blank(), axis.text.y= element_text(size = 8, face="bold"),
axis.text.x = element_text(size=14, face="bold"), axis.title.x = element_text(size=18, face="bold"), legend.key.size = unit(1.1, "cm"),legend.text = element_text(size = 16), panel.grid.minor.y = element_blank(), panel.grid.major.y = element_blank(),panel.grid.minor.x = element_blank())+
coord_flip()
pyramidalggplotConstrained analysis of proximities (CAP)
Microbial communities are greatly influenced by environmental factors. Thus, it is interesting to analyze how these variables relate to community changes. This can be achieved by studying different soil physicochemical parameters and fitting them to an ordination plot. The package vegan includes several functions and a very useful tutorial.
Among the options offered by vegan9, we have implemented them in a way it is logical for our daily studies. We will exemplify this with some data from pine trees (which can be downloaded in the same link as the sequences).
We will read the phyloseq file (saved as .RDS) and get the soil physicochemical parameters from metadata
#Read phyloseq and get relative abundance
cap_phyloseq <- readRDS(file = "cap_phyloseq")
cap_phyloseq_rel <- transform_sample_counts(cap_phyloseq,function(x){x/sum(x)}*100)
#Get ASV and metadata
ASV_cap<-as.data.frame(t(otu_table(cap_phyloseq_rel)))
metadata_cap <- as.matrix(sample_data(cap_phyloseq))
metadata_cap <- as.data.frame(metadata_cap)
metadata_cap <- metadata_cap[1:ncol(metadata_cap)-1] #Remove column Season_NucleicAcidNow, we will generate two models. CAP1 will include all variables to be part of constrained analysis, that is, it will explained the variance due to all variables, and CAP0, which is the opposite of CAP1, i.e., fully unconstrained.
These two models will be the input for ordistep function. The function starts with a model with all unconstrained variables (CAP0) and then, in each step, it adds a new variable to the constrained model. Then, it performs a checking step. If the variables left are no longer significant, or add no more explained variance to the model, the function stops. At the end, the output will be the formula with those variables that explained most variance of data. In this way, when having multiple variables, the model can be reduced and remove those that are interdependent.
#Compute CAP1 and CAP0 models
CAP1<-capscale(ASV_cap~Water+ln_P+SOM+N+ph_H2O+ph_KCl+CN+Na_Exch+Clay+Sand+Slime+Texture, data=metadata_cap,distance = "bray")
CAP0<-capscale(ASV_cap~1, data=metadata_cap, distance = "bray")
#Get formula with ordistep
CAP_ordi<-ordistep(CAP0, scope=formula(CAP1))#>
#> Start: ASV_cap ~ 1
#>
#> Df AIC F Pr(>F)
#> + ph_KCl 11 28 1.2601 0.005 **
#> + ph_H2O 11 30 1.0487 0.320
#> + Water 15 10 1.0896 0.345
#> + Clay 15 10 1.0955 0.350
#> + ln_P 8 33 1.0101 0.390
#> + Na_Exch 10 33 0.9478 0.710
#> + SOM 14 23 0.8758 0.840
#> + N 13 28 0.8611 0.935
#> + CN 16 -Inf
#> + Sand 16 -Inf
#> + Slime 16 -Inf
#> + Texture 16 -Inf
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Step: ASV_cap ~ ph_KCl
#>
#> Df AIC F Pr(>F)
#> - ph_KCl 11 28.737 1.2601 0.015 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Df AIC F Pr(>F)
#> + ph_H2O 3 19 0.9191 0.515
#> + Na_Exch 4 8 1.0277 0.520
#> + ln_P 4 10 0.9176 0.625
#> + Water 5 -Inf
#> + SOM 5 -Inf
#> + N 5 -Inf
#> + CN 5 -Inf
#> + Clay 5 -Inf
#> + Sand 5 -Inf
#> + Slime 5 -Inf
#> + Texture 5 -Inf
Thus, we know now which variables influence our microbial community according to ordistep. Although pH measured in H2O is a marginal result, we will also introduce it in our model because we were also interested in studying it. With theses variables, we can use the function envfit to fit these environmental factors as vectors into an ordination plot and get the correlation with each ordination axis.
ef <- envfit (CAP1~ph_KCl+ph_H2O, data = metadata_cap, perm = 999)
#Now we can adjust pvalues
ef.adj <- ef
pvals.adj <- p.adjust (ef$vectors$pvals, method = 'BH')
ef.adj$vectors$pvals <- pvals.adj
ef.adj#>
#> ***VECTORS
#>
#> $pvals
#> numeric(0)
#>
#>
#> ***FACTORS:
#>
#> Centroids:
#> CAP1 CAP2
#> ph_KCl5.6 -0.5680 -1.0802
#> ph_KCl5.7 -0.6385 -1.0075
#> ph_KCl5.8 -0.4246 -0.1969
#> ph_KCl5.9 -0.5018 0.1034
#> ph_KCl6.0 0.6368 0.3278
#> ph_KCl6.1 0.1524 0.6719
#> ph_KCl6.2 -0.3268 1.0491
#> ph_KCl6.4 -0.0529 1.3443
#> ph_KCl6.5 1.0748 0.1100
#> ph_KCl6.6 0.9464 -0.2098
#> ph_KCl7.0 0.4793 -0.3622
#> ph_KCl7.7 1.6547 -0.9666
#> ph_H2O6.4 -0.0529 1.3443
#> ph_H2O6.5 -0.5446 0.4548
#> ph_H2O6.6 -0.1920 -0.6177
#> ph_H2O6.7 -0.4580 0.6139
#> ph_H2O6.8 -0.3268 1.0491
#> ph_H2O6.9 0.1524 0.6719
#> ph_H2O7.0 -0.6357 -0.3595
#> ph_H2O7.1 1.0748 0.1100
#> ph_H2O7.3 0.6368 0.3278
#> ph_H2O7.7 0.4793 -0.3622
#> ph_H2O7.8 0.9464 -0.2098
#> ph_H2O8.3 1.6547 -0.9666
#>
#> Goodness of fit:
#> r2 Pr(>r)
#> ph_KCl 0.8180 0.079 .
#> ph_H2O 0.8141 0.073 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> Permutation: free
#> Number of permutations: 999
With plot_ordination we can get a graphic representation of ordination and fitted vectors. In this sense, we can visualize the direction of change of environmental variables.
#First, get distance matrix
distance_matrix <- phyloseq::distance(physeq = cap_phyloseq_rel, method = "wunifrac")
#Then, generate ordination with CAP (that is, constrained ordination to ph_KCl)
CAP_wunifrac <- ordinate(physeq = cap_phyloseq_rel, method = "CAP",distance = distance_matrix,
formula = ~ph_KCl+ph_H2O )
#Produce plot
CAP_wunifrac_plot <- plot_ordination(physeq = cap_phyloseq_rel, ordination = CAP_wunifrac,
color = "Species", axes = c(1,2)) +
aes(shape = Species) +
geom_point(aes(colour = Species), alpha = 1, size = 3.5) +
scale_shape_manual(values=c("Pinaster"=16,"Other"=15),
breaks=c("Pinaster","Other"))+
scale_color_manual(values = c("Pinaster"="brown", "Other"="orange"),
name="Host genotype",
breaks=c("Pinaster", "Other"),
labels=expression(paste("Confirmed ", italic("P. pinaster")), "Pine forest samples"))+
guides(shape="none")+
ggtitle("CAP on Weighted Unifrac distance")+
theme_bw()+
theme(legend.key=element_blank(),
legend.title.align = 0.85,
legend.title = element_text(face="bold"),
axis.text = element_text(size=14),
axis.title = element_text(size = 16),
plot.title = element_text(hjust=0.5, face="bold"),
legend.text = element_text(size = 16))+
geom_hline(aes(yintercept = c(0.00)), lty=2, colour="grey")+
geom_vline(aes(xintercept = c(0.00)), lty=2, colour="grey")
#Define arrows aesthetic and labels
arrowmat <- vegan::scores(CAP_wunifrac, display = "bp")
arrowdf <- data.frame(labels = rownames(arrowmat), arrowmat)
arrow_map <- aes(xend = CAP1, yend = CAP2,x = 0, y = 0, shape = NULL, color = NULL, label=labels)
label_map <- aes(x = 1.1 * CAP1, y = 1.1 * CAP2, shape = NULL,color = NULL, label=labels)
arrowhead <- arrow(length = unit(0.02, "npc"))
#Introduce them to plot
CAP_wunifrac_plot +
geom_segment(mapping = arrow_map, size = c(ph_KCl=1,ph_H2O=1), data = arrowdf, color = c(ph_KCl="black",ph_H2O="black"),arrow = arrowhead) +
geom_text(mapping = label_map, size = 4,data = arrowdf, show.legend = FALSE)References
Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). “DADA2: High-resolution sample inference from Illumina amplicon data.” Nature Methods, 13, 581-583. doi: 10.1038/nmeth.3869 (URL: https://doi.org/10.1038/nmeth.3869).
Lin H, Peddada SD (2020). “Analysis of compositions of microbiomes with bias correction.” Nature communications, 11(1), 1–11. doi: 10.1038/s41467-020-17041-7, https://www.nature.com/articles/s41467-020-17041-7.
FIGARO: An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters Michael M. Weinstein, Aishani Prem, Mingda Jin, Shuiquan Tang, Jeffrey M. Bhasin bioRxiv 610394; doi: https://doi.org/10.1101/610394
MARTIN, Marcel. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, [S.l.], v. 17, n. 1, p. pp. 10-12, may 2011. ISSN 2226-6089. Available at: http://journal.embnet.org/index.php/embnetjournal/article/view/200. Date accessed: 06 oct. 2021. doi:https://doi.org/10.14806/ej.17.1.200.
Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772-780. doi:10.1093/molbev/mst010
Price MN, Dehal PS, Arkin AP. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5(3):e9490. Published 2010 Mar 10. doi:10.1371/journal.pone.0009490.
McMurdie PJ, Holmes S. phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data. Watson M, editor. PLoS One. 2013;8: e61217. pmid:23630581
Robinson MD M D S G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26: 139–140. pmid:19910308
Wagner H. Vegan: community ecology package. R package version 2.5.2–5. 2018. https://www.mcglinnlab.org/publication/2019-01-01_oksanen_vegan_2019/